{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# Copyright 2019 Google LLC\n",
    "#\n",
    "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
    "# you may not use this file except in compliance with the License.\n",
    "# You may obtain a copy of the License at\n",
    "#\n",
    "# https://www.apache.org/licenses/LICENSE-2.0\n",
    "#\n",
    "# Unless required by applicable law or agreed to in writing, software\n",
    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
    "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
    "# See the License for the specific language governing permissions and\n",
    "# limitations under the License."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## Overview\n",
    "This notebook uses the [Census Income Data Set](https://archive.ics.uci.edu/ml/datasets/Census+Income) to demonstrate how to train a model and generate local predictions using XGBoost."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "##  Dataset\n",
    "The [Census Income Data Set](https://archive.ics.uci.edu/ml/datasets/Census+Income) that this sample\n",
    "uses for training is provided by the [UC Irvine Machine Learning\n",
    "Repository](https://archive.ics.uci.edu/ml/datasets/)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Disclaimer\n",
    "This dataset is provided by a third party. Google provides no representation,\n",
    "warranty, or other guarantees about the validity or any other aspects of this dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## Install Packages and dependencies\n",
    "\n",
    "Install addional dependencies not installed in your Notebook environment\n",
    "(e.g. XGBoost, adanet, tf-hub)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "%pip install xgboost"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "# Build your model\n",
    "\n",
    "We will create an XGBoost model and then perform local predictions, proceed to define \n",
    "imports."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "import datetime\n",
    "import os\n",
    "\n",
    "import pandas as pd\n",
    "import xgboost as xgb\n",
    "import numpy as np\n",
    "\n",
    "from sklearn.base import BaseEstimator, TransformerMixin\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.pipeline import FeatureUnion, make_pipeline\n",
    "\n",
    "import warnings\n",
    "warnings.filterwarnings(action='ignore', category=DeprecationWarning)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "### Download the Data\n",
    "\n",
    "We can simply download the dataset from [UC Irvine Machine Learning\n",
    "Repository](https://archive.ics.uci.edu/ml/datasets/) to our local machine:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "!wget https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## Loading the Data\n",
    "\n",
    "We'll use Pandas to load the dataset as a DataFrame:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "census_data_filename = './adult.data'\n",
    "\n",
    "# These are the column labels from the census data files\n",
    "COLUMNS = (\n",
    "    'age',\n",
    "    'workclass',\n",
    "    'fnlwgt',\n",
    "    'education',\n",
    "    'education-num',\n",
    "    'marital-status',\n",
    "    'occupation',\n",
    "    'relationship',\n",
    "    'race',\n",
    "    'sex',\n",
    "    'capital-gain',\n",
    "    'capital-loss',\n",
    "    'hours-per-week',\n",
    "    'native-country',\n",
    "    'income-level'\n",
    ")\n",
    "\n",
    "# Load the training census dataset\n",
    "with open(census_data_filename, 'r') as train_data:\n",
    "    raw_training_data = pd.read_csv(train_data, header=None, names=COLUMNS)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Now, let's take a look at the data to have a better understanding of it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "raw_training_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "First, let's separate the features and the target and convert them to numpy objects:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "raw_features = raw_training_data.drop('income-level', axis=1).values\n",
    "# Create training labels list\n",
    "train_labels = (raw_training_data['income-level'] == ' >50K').values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## Data Preprocessing\n",
    "\n",
    "The features are a combination of both numerical and categorical values. As a part of data preparation before we can feed the data to the modell, we will need to convert the categorical features to numerical. We will use scikit-learn libraries to prepare the data.\n",
    "\n",
    "### Why scikit-learn?\n",
    "scikit-learn has an amazing API to create and train a pipeline to preprocess the data before feeding to the model. We will use a custom pipeline in this notebook to prepare the data for XGBoost:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "class PositionalSelector(BaseEstimator, TransformerMixin):\n",
    "    def __init__(self, positions):\n",
    "        self.positions = positions\n",
    "\n",
    "    def fit(self, X, y=None):\n",
    "        return self\n",
    "\n",
    "    def transform(self, X):\n",
    "        return np.array(X)[:, self.positions]\n",
    "\n",
    "\n",
    "class StripString(BaseEstimator, TransformerMixin):\n",
    "    def fit(self, X, y=None):\n",
    "        return self\n",
    "\n",
    "    def transform(self, X):\n",
    "        strip = np.vectorize(str.strip)\n",
    "        return strip(np.array(X))\n",
    "\n",
    "\n",
    "class SimpleOneHotEncoder(BaseEstimator, TransformerMixin):\n",
    "    def fit(self, X, y=None):\n",
    "        self.values = []\n",
    "        for c in range(X.shape[1]):\n",
    "            Y = X[:, c]\n",
    "            values = {v: i for i, v in enumerate(np.unique(Y))}\n",
    "            self.values.append(values)\n",
    "        return self\n",
    "\n",
    "    def transform(self, X):\n",
    "        X = np.array(X)\n",
    "        matrices = []\n",
    "        for c in range(X.shape[1]):\n",
    "            Y = X[:, c]\n",
    "            matrix = np.zeros(shape=(len(Y), len(self.values[c])), dtype=np.int8)\n",
    "            for i, x in enumerate(Y):\n",
    "                if x in self.values[c]:\n",
    "                    matrix[i][self.values[c][x]] = 1\n",
    "            matrices.append(matrix)\n",
    "        res = np.concatenate(matrices, axis=1)\n",
    "        return res"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "To simplify things a little, we create a pipeline object that only uses the following features:\n",
    "* Categorical: workclass, education, marital-status, and relationship\n",
    "* Numerical: age and hours-per-week\n",
    "\n",
    "Now we can create a pipeline object and train it to process our data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# Categorical features: age and hours-per-week\n",
    "# Numerical features: workclass, marital-status, and relationship\n",
    "numerical_indices = [0, 12]  # age-num, and hours-per-week\n",
    "categorical_indices = [1, 3, 5, 7]  # workclass, education, marital-status, and relationship\n",
    "\n",
    "p1 = make_pipeline(PositionalSelector(categorical_indices), \n",
    "                   StripString(), \n",
    "                   SimpleOneHotEncoder())\n",
    "p2 = make_pipeline(PositionalSelector(numerical_indices), \n",
    "                   StandardScaler())\n",
    "\n",
    "pipeline = FeatureUnion([\n",
    "    ('numericals', p1),\n",
    "    ('categoricals', p2),\n",
    "])\n",
    "\n",
    "train_features = pipeline.fit_transform(raw_features)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Our dataset is ready for training the model now:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# train the model\n",
    "model = xgb.XGBClassifier(max_depth=4)\n",
    "model.fit(train_features, train_labels)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Once we train the model, we can simply just save it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# save the mode\n",
    "model.save_model('model.bst')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## Predictions\n",
    "In order to make prediction, we need both `pipline` and `model` objects:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "instances = [[\n",
    "    42, ' State-gov', 77516, ' Bachelors', 13, ' Never-married',\n",
    "    ' Adm-clerical', ' Not-in-family', ' White', ' Male', 2174, 0, 40,\n",
    "    ' United-States'\n",
    "],\n",
    "    [\n",
    "        50, ' Self-emp-not-inc', 83311, ' Bachelors', 13,\n",
    "        ' Married-civ-spouse', ' Exec-managerial', ' Husband',\n",
    "        ' White', ' Male', 0, 0, 10, ' United-States'\n",
    "    ]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "First, we need to preprocess the instances:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "processed_instances = pipeline.transform(instances)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Then we'll pass the processed data to the model for classification:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "model.predict(processed_instances)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
